In an era where technology is rapidly advancing, the integration of 5G connectivity into smartphones has become a pivotal feature for consumers and manufacturers alike. The primary focus of this project lies in predicting whether a given smartphone possesses 5G connectivity or not.
The dataset used in this project, called “Smartphones_Dataset,” is sourced from Kaggle (https://www.kaggle.com/datasets/informrohit1/smartphones-dataset), and was updated in 2024. The Smartphones dataset comprises 210 rows and 26 columns, containing information scraped from the web on various attributes of different smartphone models. We will be implementing multiple machine learning techniques to yield the most accurate model for this binary classification problem.
5G connectivity refers to the fifth-generation wireless technology that offers significantly faster data speeds, greater reliability, and enhanced capacity compared to previous generations. Its importance lies in its transformative potential to revolutionize various aspects of society and industry as it is 100 times faster than 4G networks. By providing a foundation for innovation and connectivity, 5G enables new possibilities and improves quality of life for individuals and communities alike.
My past summer internships focused on enhancing the design of RF filters which provided me with valuable insights into the intricate world of wireless communication systems. Witnessing the innovation and efficiency in the field of RF engineering firsthand sparked my interest in exploring this topic for my project. By drawing upon my internship insights and combining them with machine learning techniques, I aim to have a better understanding of the ever-evolving technology landscape.
To begin, we will load in our dataset and do initial data manipulation and cleaning. From there, we will perform some exploratory data analysis to gain further insight into our variables and their relevance. Our goal is to use predictor variables to predict a binary class “Yes”, which indicates a response variable detailing if the smartphone is a 5G model. We will then perform a training/test split on our data, make a recipe, and set folds for the 10-fold cross validation we will implement. Random Forest, K Nearest Neighbors, Decision Tree, Support Vector Machine, Logistic Regression, Lasso Regression, Linear Discriminant Analysis, and Quadratic Discriminant Analysis are all the models used to model the training data. Depending on which model performs the best, we will then fit to our testing data set and analyze how effective our model is.
# Loading the data
smartphones_data <- read.csv("/Users/kira1/Downloads/smartphones_cleaned_v6.csv")
# Cleaning variable names
smartphones_clean <- smartphones_data %>%
clean_names()
# Mutating Data
smartphones_clean <- smartphones_clean %>%
mutate(has_5g = ifelse(has_5g == "True", "Yes", "No")) %>%
mutate(has_nfc = ifelse(has_nfc == "True", 1, 0 )) %>%
mutate(has_ir_blaster = ifelse(has_ir_blaster == "True", 1, 0))
smartphones_clean$has_5g <- factor(smartphones_clean$has_5g)First we load the dataset from a CSV file and clean the variable
names for consistency and readability. I decided to modify certain
variables in the dataset: has_5g is converted to a factor
with levels “Yes” and “No” to indicate the presence or absence of 5G
connectivity, while has_nfc and has_ir_blaster
are converted to a binary numeric variable so they can be used in our
recipe later on without being dummy coded.
Next, we want to identify if the dataset has any missing values.
# removing NA values
smartphones_clean <- drop_na(smartphones_clean)
# removing columns
smartphones_clean <- smartphones_clean %>%
select(-c(extended_upto,
model, fast_charging_available, os, extended_memory_available
))## [1] 0
From our plot above we can see that there are indeed missing values
present. However, since the percentage of missing data is very small
(3.4%), I decided to remove all rows with missing data and the column,
extended_upto, since almost half of the data is missing. I
also decided to remove the columns model,
fast_charging_available, os, and
extended_memory_availabe since their respective values seem
to be mostly the same throughout the data set after removing missing
data. Now we can proceed as there are no longer missing values!
Let’s take quick look at our dataset by displaying the first 6 rows before we begin using the data.
To gain deeper insights into the distribution of our response variable and the relationships with our predictors, we will create an output variable plot and a correlation matrix. These visualizations will help us identify any potential correlations between our predictor variables. We will also generate visualization plots to observe the impact of specific variables of interest on our response variable.
# 5G Connectivity Distribution
smartphones_clean %>%
ggplot(aes(x = has_5g, fill = has_5g)) +
geom_bar(color="black") +
theme_minimal() +
labs(
title = "Count of Smartphones with 5G",
x = "5G",
y = "Count"
) +
scale_fill_manual(values = c("#AEE1F0", "#A8E9BE")) As an initial step, we can explore the distribution of smartphones with 5G connectivity. The number of 5G models are lower, indicating that this advanced feature is still emerging in the market and has not yet reached widespread adoption. This lower prevalence of 5G-enabled devices may be due to several factors, including higher production costs and limited availability of 5G infrastructure. However, in the near future I expect more smartphones to be 5G models as there is an increasing demand.
Next, we will perform a correlation plot to visualize how related our variables are.
We choose to exclude categorical variables in the correlation matrix
plot since correlation coefficients are calculated for numeric
variables. As we observed the correlation matrix plot, we noticed that
most of our variables are positively correlated with each other or if
they are negatively correlated it is a very weak correlation.
rating has positive correlations with all other variables,
especially the strongest positive correlations with
ram_capacity. Additionally, ram_capacity has a
strong positive with internal_memory. Although weak,
has_nfc and has_ir_blaster have the strongest
negative correlation.
To ensure our models do not run into multicolinearity errors we can
use the vif() function to calculate the Variance Inflation
Factor for each predictor variable in the fitted logistic regression
model.
# Ensure 'has_5g' is numeric
smartphones_clean$has_5g <- as.numeric(smartphones_clean$has_5g)
# Fit the logistic regression model
fit <- glm(has_5g ~ rating + price + has_nfc + has_ir_blaster + num_cores +
processor_speed + battery_capacity + ram_capacity +
internal_memory + screen_size +
primary_camera_rear + primary_camera_front +
resolution_width + resolution_height,
data = smartphones_clean)
# Calculate VIF
vif_values <- vif(fit)
print(vif_values)## rating price has_nfc
## 11.457576 2.217745 1.582151
## has_ir_blaster num_cores processor_speed
## 1.215161 1.177892 1.793486
## battery_capacity ram_capacity internal_memory
## 1.084197 4.082950 2.685149
## screen_size primary_camera_rear primary_camera_front
## 1.165406 2.386553 1.688200
## resolution_width resolution_height
## 1.387191 2.161924
# Convert data back
smartphones_clean <- smartphones_clean %>%
mutate(has_5g = ifelse(has_5g == 2, "Yes", "No"))
smartphones_clean$has_5g <- factor(smartphones_clean$has_5g)As we can see our most of the VIF values are low, except for
rating, which has a really high value of 11.457576.
Therefore, we will not be including this variable in our recipe. We also
have to make sure our response variable is converted back to a factor
with “Yes” and “No”.
The distribution of rating plot provides a clear visualization of how frequently each rating value appears among the smartphones in the dataset. By segmenting the data based on 5G connectivity, we can observe distinct patterns in how 5G-capable smartphones are rated compared to those without 5G.
smartphones_clean %>%
dplyr::select('rating', 'has_5g') %>%
dplyr::mutate(rating = cut(rating, breaks =
seq(min(rating), max(rating), by = 1),
include.lowest = TRUE)) %>%
group_by(rating) %>%
na.omit(rating) %>%
ggplot(aes(rating)) +
geom_bar(aes(fill = has_5g), color="black") +
theme(axis.text.x = element_text(angle = 90)) +
scale_fill_manual(values = c("#AEE1F0", "#A8E9BE")) +
labs(
title = "Distribution of Rating",
x = "Rating",
y = "Count"
)From the plot, it is evident that higher ratings are more commonly associated with smartphones that have 5G connectivity. This trend suggests that consumers tend to rate 5G-enabled smartphones more favorably, likely due to the enhanced performance, faster data speeds, and advanced features that 5G technology offers. However, the plot also shows that there are still numerous smartphones without 5G connectivity that receive high ratings. This indicates that while 5G is a desirable feature, it is not the sole determinant of a smartphone’s quality or consumer satisfaction.
Our next graph offers a comprehensive visualization of the frequency at which each brand name appears in the dataset, segmented by 5G connectivity.
smartphones_clean %>%
ggplot(aes(x = brand_name, fill = has_5g)) +
geom_bar(position = "dodge", color="black") +
labs(title = "Distribution of 5G Capability by Brand",
x = "Brand Name",
y = "Count",
fill = "Has 5G") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_manual(values = c("#AEE1F0", "#A8E9BE")) As we can see from this graph, Samsung and Xiami both have high counts of 5G smartphones, suggesting that they are leaders in adopting 5G technology. However, they still have more non-5G models which could either indicate a focus on more affordable models or they are catering to a diverse customer base. Oppo, Poco, and Iqobo are brands that has a higher count of 5G smartphones than non-5G models, and Honor only produces 5G smartphones. It was interesting to dive deeper into the brands this dataset holds as I have never heard of some of these brands and was also surprised that Apple was excluded from this data!
Now that we have a general idea of how our variables impact whether a smartphone has 5G connectivity or not, we can being our train / test split, create our recipe, and establish cross validation to help set up our models.
To begin setting up our models, we first need to split our data into
separate datasets: one used for training our models and one used when we
actually test our model at the end. Our very first step is setting our
seed so that the random split will be reproduced every time. Then, we
will perform a split on our data by stratifying on our response
variable, has_5g. I used a split of 75% and 25% to maximize
the data that we have to train the model since we do not have that many
observations.
set.seed(123)
smartphones_split <- initial_split(smartphones_clean, strata = has_5g, prop = 0.75)
smartphones_train <- training(smartphones_split)
smartphones_test <- testing(smartphones_split)The dimensions of our training set:
## [1] 266 21
The dimensions of our testing set:
## [1] 90 21
After removing unnecessary variables, we can get a better
understanding of what each variable is. The variables that I selected
for the true data set and will be using in my model recipe to predict 5G
connectivity are:
brand_name: The name of the smartphone’s
manufacture
price: The cost of the smartphone in the specified currency
rating: The user or expert rating of the smartphone on a
scale from 0-100
has_5g: Indicates whether the smartphone supports 5G
connectivity (Yes/No)
has_nfc: Indicates whether the smartphone has Near Field
Communication (NFC) capability (1 for Yes, 0 for No)
has_ir_blaster: Indicates whether the smartphone is
equipped with an infrared blaster (1 for Yes, 0 for No)
processor_brand: The brand of the smartphone’s
processor
num_cores: The number of cores in the smartphone’s
processor
processor_speed: The clock speed of the smartphone’s
processor, usually measured in GHz
battery_capacity: The capacity of the smartphone’s battery
fast_charging: The power of the fast charging capability of
the smartphone
ram_capacity: The amount of Random Access Memory (RAM) in
the smartphone
internal_memory: The storage capacity of the
smartphone
screen_size: The diagonal size of the smartphone’s
screen
refresh_rate: The refresh rate of the smartphone’s
screen
num_rear_cameras: The number of cameras on the back side of
the smartphone
num_front_cameras: The number of cameras on the front side
of the smartphone
primary_camera_rear: The resolution of the primary back
camera
primary_camera_front: The resolution of the primary front
camera
resolution_width: The width resolution of the smartphone’s
screen resolution_height: The height resolution of the
smartphone’s screen
smartphones_recipe <- recipe(has_5g ~ price + has_nfc + has_ir_blaster +
processor_speed + battery_capacity +
fast_charging + internal_memory + screen_size +
refresh_rate + num_rear_cameras +
primary_camera_rear + primary_camera_front +
resolution_width + resolution_height, data=smartphones_clean) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())We will use our predictor and response variables to build the recipe
that we will use for all of the models. Essentially, each variable that
we have included will be used to predict our response variable of
has_5g. As explained above we are not going to include
ratings in the recipe. I decided to exclude
brand_name and processor_brand in the recipe
because I wanted to focus on numerical variables. Furthermore,
num_cores and num_front_cameras were not
included because their respective data was almost all the same.
Lastly, we normalize the variables by centering and scaling to ensure that the data is appropriately prepared for modeling.
K-fold cross-validation is a technique used in machine learning to assess the performance and generalization ability of a model. It helps to assess how well the model generalizes to new and unseen data by dividing the dataset into k subsets or folds. The model is trained and evaluated k times, each time using a different fold as the test set and the remaining folds as the training set.
We used vfold_cv() to create 10 folds from the training
set. We also stratified the data based on the response variable,
has_5g, which ensures each subset or fold in the
cross-validation process maintains a representative distribution.
Now it is finally time to build our models!
I chose ROC AUC as my performance metric because it effectively measures the efficiency of a binary classification model, especially when the data is not perfectly balanced. ROC AUC is determined by the area under the Receiver Operating Characteristic (ROC) curve, which illustrates the performance of a binary classifier across various discrimination thresholds. This metric is particularly valuable for evaluating models across different classification thresholds, providing insights into the trade-off between sensitivity and specificity. A higher ROC AUC value, closer to 1, indicates better performance. Thus, suggesting that the model can effectively distinguish between the two classes. A ROC AUC value of 0.5 indicates that the model performs the same as random chance.
I set up models for K-Nearest Neighbor, Logistic Regression, Linear
Discriminant Analysis, Quadratic Discriminant Analysis, Random Forest,
Lasso Regression, Support Vector Machine, and Decision Tree in which
these models followed this general process:
1) set up the model by specifying the type of model, setting
the engine, and setting the mode as classification
2) set up the workflow, add the new model, and add our
smartphones recipe
3) set up the tuning grid with the parameters we want and set
ranges for desired levels of tuning
4) tune the model with hyperparameters of our choice
5) select the most accurate model from the tuning grid and
finalize workflow with the tuning parameters
6) fit that model with our workflow to our smartphones training
data
7) save our results to an RDA file to be loaded back into the
project file
(don’t have to use steps 3 and 5 for Logistic Regression, LDA, and
QDA)
Since running our models took up a lot of time, each model was saved into RDA files to then be loaded back in to explore the results.
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_qda.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_lda.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_knn.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_logistic_regression.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_lasso_regression.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_random_forest.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_decision_tree.rda")
load("~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_support_vector_machine.rda")To visualize our results, we will use the autoplot
function in R which allows us to analyze certain parameters on our
metric of choice: roc_auc.
knn <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_wkflow <- workflow()%>%
add_model(knn) %>%
add_recipe(smartphones_recipe)
knn_grid <- grid_regular(neighbors(range = c(1, 10)),
levels = 10)
knn_fit <- tune_grid(
knn_wkflow,
resamples = smartphones_folds,
grid = knn_grid
)
best_knn <- select_best(knn_fit, metric="roc_auc")
knn_final <- finalize_workflow(knn_wkflow, best_knn)
knn_final_fit <- fit(knn_final, data = smartphones_train)
save(knn_fit, knn_final_fit,
file = "~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_knn.rda")K Nearest Neighbor is a supervised machine learning algorithm working off similarity, assuming similar data points are located close to each other in the feature space. In our plot, we saw that the greater number of nearest neighbors, the more accurate our model is. The highest ROC AUC was a little above 0.88, which is already pretty effective. However, it may not be as effective as our other models since KNN tends to do worse if there are too many predictors or dimensions.
qda <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_wkflow <- workflow() %>%
add_model(qda) %>%
add_recipe(smartphones_recipe)
qda_fit <- fit(qda_wkflow, smartphones_train)
predict(qda_fit, new_data = smartphones_train, type="prob")
qda_kfold_fit <- fit_resamples(qda_wkflow, smartphones_folds, control = control_grid(save_pred = TRUE))
collect_metrics(qda_kfold_fit)
smartphones_roc_qda <- augment(qda_fit, smartphones_train)
save(qda_fit, qda_kfold_fit, smartphones_roc_qda,
file = "~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_qda.rda")Quadratic Discriminant Analysis is a classification algorithm used for modeling and classifying data. It is similar to Linear Discriminant Analysis (LDA) but instead can be more accurate as it is used to find a non-linear boundary between our classifiers. By plotting our ROC Curve above for our QDA model, we can see it performed relatively well, however when compared to our other models it did relatively average.
random_forest <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rand_forest_wkflw <- workflow() %>%
add_recipe(smartphones_recipe) %>%
add_model(random_forest)
rf_grid <- grid_regular(mtry(range = c(2, 14)), trees(range = c(2, 10)),
min_n(range = c(2, 8)), levels = 8)
rf_fit_auc <- tune_grid(
rand_forest_wkflw,
resamples = smartphones_folds,
grid = rf_grid,
metrics = metric_set(yardstick::roc_auc)
)
best_rf_auc <- dplyr::arrange(collect_metrics(rf_fit_auc), desc(mean))
head(best_rf_auc)
best_rf_complex_auc <- select_best(rf_fit_auc, metric="roc_auc")
rf_final_auc <- finalize_workflow(rand_forest_wkflw, best_rf_complex_auc)
rf_final_fit_auc <- fit(rf_final_auc, data = smartphones_train)
rf_fit_res_accuracy <- tune_grid(
rand_forest_wkflw,
resamples = smartphones_folds,
grid = rf_grid,
metrics = metric_set(accuracy)
)
best_rf_accuracy <- dplyr::arrange(collect_metrics(rf_fit_res_accuracy), desc(mean))
head(best_rf_accuracy)
best_rf_complex_accuracy <- select_best(rf_fit_res_accuracy, metric="accuracy")
rf_final_accuracy <- finalize_workflow(rand_forest_wkflw, best_rf_complex_accuracy)
rf_final_fit_accuracy <- fit(rf_final_accuracy, data = smartphones_train)
save(rf_fit_auc, rf_final_fit_auc, best_rf_auc,
rf_fit_res_accuracy, rf_final_fit_accuracy, best_rf_accuracy,
file = "~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_random_forest.rda")A random forest model is a supervised machine learning technique
incorporating multiple decision trees. Decision tree models can over-fit
training data, but random forest models minimize this problem by
averaging each tree’s prediction to make a final output. In our random
forest, we tuned three different parameters: mtry: number
of predictors used to to sample while splitting into the tree models,
trees: the number of trees simulated in the model,
min_n: minimum number of observations required to create a
terminal node during the tree-building process.
We see the ROC AUC scores vary depending on the number of trees, but there is a general trend in which more trees lead to higher ROC AUC scores. The optimal node size seems to be 3, with 10 trees and 7 predictors. As the number of predictors increase, so did our accuracy and as the number of trees increased, the ROC AUC also typically increased. Using 10 trees allowed for greater visualization of the relation between more trees and higher ROC AUC.
svm <- svm_rbf() %>%
set_mode("classification") %>%
set_engine("kernlab")
svm_wkflw <- workflow() %>%
add_recipe(smartphones_recipe) %>%
add_model(svm %>% set_args(cost = tune()))
svm_grid <- grid_regular(cost(range = c(-10, 5)), levels = 10)
svm_fit_auc <- tune_grid(
svm_wkflw,
resamples = smartphones_folds,
grid = svm_grid,
metrics = metric_set(yardstick::roc_auc)
)
best_svm_auc <- dplyr::arrange(collect_metrics(svm_fit_auc), desc(mean))
head(best_svm_auc)
best_svm_complex_auc <- select_best(svm_fit_auc, metric="roc_auc")
svm_final_auc <- finalize_workflow(svm_wkflw, best_svm_complex_auc)
svm_final_fit_auc <- fit(svm_final_auc, data = smartphones_train)
svm_fit_accuracy <- tune_grid(
svm_wkflw,
resamples = smartphones_folds,
grid = svm_grid,
metrics = metric_set(accuracy)
)
best_svm_accuracy <- dplyr::arrange(collect_metrics(svm_fit_accuracy), desc(mean))
head(best_svm_accuracy)
best_svm_complex_accuracy <- select_best(svm_fit_accuracy, metric="accuracy")
svm_final_accuracy <- finalize_workflow(svm_wkflw, best_svm_complex_accuracy)
svm_final_fit_accuracy <- fit(svm_final_accuracy, data = smartphones_train)
save(svm_fit_auc, best_svm_auc, svm_final_fit_auc,
best_svm_accuracy, svm_final_fit_accuracy,
file = "~/Documents/PSTAT131/131_FinalProject/RDA/smartphones_support_vector_machine.rda")Support Vector Machine (SVM) is a supervised machine learning algorithm particularly well-suited for binary classification tasks. In SVM, each observation is plotted in an n-dimensional space, where n represents a particular coordinate of our features. The primary objective is to identify a hyperplane that best separates the two classes. This hyperplane is determined by support vectors, which are the data points closest to the decision boundary. Evaluating the Support Vector Machine model in our project, we observed that it performed exceptionally well, even surpassing the performance of the random forest model by 0.002.
To summarize the best ROC AUC values, we will create a table that displays the estimated final roc_auc value for each fitted model.
knn_auc <- augment(knn_final_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
logreg_auc <- augment(log_reg_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
lda_auc <- augment(lda_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
qda_auc <- augment(qda_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
random_forest_auc <- augment(rf_final_fit_auc, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
lasso_auc <- augment(lasso_final_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
svm_auc <- augment(svm_final_fit_auc, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
decision_tree_auc <- augment(dt_final_fit, new_data = smartphones_train) %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
smartphones_roc_aucs <- c(knn_auc$.estimate,
logreg_auc$.estimate,
lda_auc$.estimate,
qda_auc$.estimate,
lasso_auc$.estimate,
decision_tree_auc$.estimate,
random_forest_auc$.estimate,
svm_auc$.estimate)
smartphones_mod_names <- c("KNN",
"Logistic Regression",
"LDA",
"QDA",
"Lasso",
"Decision Tree",
"Random Forest",
"SVM")smartphones_results <- tibble(Model = smartphones_mod_names,
ROC_AUC = smartphones_roc_aucs)
smartphones_results <- smartphones_results %>%
dplyr::arrange(-smartphones_roc_aucs)
smartphones_results
To help visualize these results, we can also use a bar plot.
smartphones_bar_plot <- ggplot(smartphones_results, aes(x = Model, y = ROC_AUC)) + geom_bar( stat = "identity", width = 0.2, fill = "#B0E0E6", color="#20B2AA") + labs(title = "Performance of Our Models") + theme_minimal() + coord_cartesian(ylim = c(0.8, 1))
smartphones_bar_plotAs we can see Random Forest performed the best overall with a ROC AUC score of 0.9985469. The Support Vector Machine followed close behind at 0.9981982. Since this is only fitted on the training data we will then perform these models on our testing data we have yet to use. For this next step we will be moving forward with both our Random Forest model.
Now that we concluded our best model is a Random Forest model, we can continue analyzing its results. Even with best overall performance, we want to examine how it performs on new data.
is RF Model 035! This model performed the best overall.
Now, we can use it to fit our testing data and discover its actual performance in predicting if a smartphone possesses 5G connectivity.
smartphones_rf_roc_auc <- augment(rf_final_fit_auc, new_data = smartphones_test, type = 'prob') %>%
roc_auc(has_5g, .pred_No) %>%
select(.estimate)
smartphones_rf_roc_aucWe can now find our model #035’s true ROC AUC performance results on our testing data. Our model’s ROC AUC performance results show a ROC AUC Score of 0.8879049, which is relatively high! Although it is lower than the training data ROC AUC score, which indicates that the model might have slightly overfitted to the training data, it still demonstrates strong predictive power on unseen data. This high ROC AUC score suggests that our model is effective at distinguishing between smartphones with and without 5G connectivity.
rf_roc_curve <- augment(rf_final_fit_auc, new_data = smartphones_test, type = 'prob') %>%
roc_curve(has_5g, .pred_No) %>%
autoplot()
rf_roc_curveTo visualize our AUC score, we plot our ROC curve. The higher up and to the left the curve is, the better the model’s AUC will be. Wile our curve does not perfectly resemble a right angle, it still sits in the top left which means our model has a relatively high true positive rate and a low false positive rate.
Now it is time to see how effective our model is at predicting if a smartphone has 5G connectivity or not. I have collected data from two smartphones in the dataset with one of them being having 5G and the other being a non 5G model. We want to see if our model will correctly classify each of them.
smartphone_yes_5g <- data.frame(
price = 16990,
has_nfc = 1,
has_ir_blaster = 0,
processor_speed = 3.20,
battery_capacity = 5000,
fast_charging = 100,
internal_memory = 256,
screen_size = 6.70,
refresh_rate = 120,
num_rear_cameras = 3,
primary_camera_rear = 50,
primary_camera_front = 16.0,
resolution_width = 1440,
resolution_height = 3216
)
predict(rf_final_fit_auc, smartphone_yes_5g, type = "class")We can see that our model correctly classified this smartphone as having 5G connectivity!
smartphone_no_5g <- data.frame(
price = 9999,
has_nfc = 0,
has_ir_blaster = 0,
processor_speed = 2.30,
battery_capacity = 5000,
fast_charging = 10,
internal_memory = 32,
screen_size = 6.51,
refresh_rate = 60,
num_rear_cameras = 2,
primary_camera_rear = 13,
primary_camera_front = 5.0,
resolution_width = 720,
resolution_height = 1600
)
predict(rf_final_fit_auc, smartphone_no_5g, type = "class")Our model correctly predicted the non 5G model. Success!
We can also visualize the performance of our random model using a variable importance plot and confusion matrix.
library(vip)
rf_final_fit_auc %>%
extract_fit_engine() %>%
vip(aesthetics = list(fill = "#B0E0E6", color="#20B2AA"), num_features= 14)
We can see that the most important variables in predicting 5G
connectivity or not is price, refresh rate, and processor speed which
makes sense.
final_fit_train_rf <- augment(rf_final_fit_auc,
smartphones_test) %>%
select(has_5g, starts_with(".pred"))
conf_mat(final_fit_train_rf, truth = has_5g,
.pred_class) %>%
autoplot(type = "heatmap") + scale_fill_gradient(low = "#E6E6FA")## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
We can see the best-performing Random Forest model has done a good job
as there are only a handful of missclassifications in the testing set.
It is interesting to note that our model is more keen to missclassifying
fase positives. The top-right square shows the count of smartphones that
do not have 5G connectivity but were incorrectly predicted to have 5G
connectivity (actual has_5g is “No” and predicted .pred_class is
“Yes”).
After thorough research and testing, we found that the Random Forest model emerged as the most proficient in predicting smartphone 5G capability. We also discovered that price, refresh rate, and processor speed are important variables when predicting 5G capability.
Looking ahead, potential extensions to this project could involve the application of more advanced techniques such as neural networks. By leveraging the power of neural networks, we could explore more intricate relationships within the data and potentially achieve even higher predictive accuracy. Additionally, integrating image recognition capabilities into the model could open up new possibilities, allowing for the identification of smartphone features directly from product images.
As we continue to explore the intersections of technology and human
experience, we are reminded of the boundless opportunities for
innovation and discovery that lie ahead. I am excited to hopfully
continue pursuing a career in tech!